Introduction

In this workshop, you’ll be looking at regression in R. Try and remember that this is a complex topic and we’re just introducing it here. Most importantly, make sure you understand the logic behind these methods. There is some supporting material to help you with this on Canvas. Take the workshop at your own pace and don’t be afraid to take breaks and ask for help.

Correlation or regression?

Whilst regression and correlation are both concerned with associations between numeric variables they are different techniques and each is appropriate under different circumstances. This is a frequent source of confusion. Which technique is required for a particular analysis depends on the way the data were collected and the purpose of the analysis. There are two broad questions to consider:

Where do the data come from?

Think about how the data have been collected. If the data are from an experimental study where one of the variables has been manipulated, then choosing the best analysis is easy. We should use a regression analysis, in which the predictor variable is the experimentally manipulated variable and the response variable is the measured outcome. The fitted line from a regression analysis describes how the outcome variable depends on the manipulated variable - it describes the causal relationship between them.

It is generally inappropriate to use correlation to analyse data from an experimental setting. A correlation analysis examines association but does not imply the dependence of one variable on another. This is why we often say: Correlation does not necessarily imply causation. Since there is no distinction of response or predictor variables, it doesn’t matter which way round we do a correlation. The phrase “which way round” doesn’t even make sense in the context of a correlation.

If the data are from an observational study, either method may be appropriate. Time to ask another question:

What is the goal of the analysis?

Think about what question is being addressed. A correlation coefficient only quantifies the strength and direction of an association between two variables, it tells us nothing about the form of a relationship. Nor does it allow us to make predictions about the value of one variable from the value of a second variable. A regression does allow this because it involves fitting a line through the data - i.e. there’s a model for the relationship.

This means that if the goal of an analysis is to understand the form of a relationship between two variables, or to use a fitted model to make predictions, we have to use regression. If we just want to know whether two variables are associated or not, the direction of the association, and whether the association is strong or weak, then a correlation analysis is sufficient. It is better to use a correlation analysis when the extra information produced by a regression is not needed, because the former will be simpler and potentially more robust.

Linear regression

Linear regression models account for a straight line relationship between two numeric variables, i.e. they describe how the response variable changes in response to the values of the predictor variable. It is conventional to label the response variable as ‘y’ and the predictor variable as ‘x’. When we present such data graphically, the response variable always goes on the y-axis and the predictor variable on the x-axis. Try not to forget this convention!

How do we decide which is to be used as the response variable and which as the predictor variable? The decision is fairly straightforward in an experimental setting: the manipulated variable is the predictor variable, and the measured outcome is the response variable. Things may not be so clear cut when we are working with data from an observational study as it may not be obvious that one variable depends upon the other (in a causal sense). In regression it matters which way round we designate the response and predictor variables. If you have two variables A and B, the relationship you find from a regression will not be the same for A against B as for B against A.

Data

As an example, I’m using some data about three species of penguin from the Palmer archipelago in Antarctica. To use these data, you’ll need to install the palmerpenguins package.

install.packages("palmerpenguins")
penguins <- palmerpenguins::penguins
penguins <- palmerpenguins::penguins

It may seem reasonable to hypothesise that penguins with a larger body mass will also have longer flippers. Furthermore, we might hypothesise that larger flippers are the result of being a larger animal. We can use linear regression to assess this potential relationship.

First, it’s a good idea to plot the data and get a feel for what it looks like.

This looks to satisfy the key assumption of linear regression: that the relationship can be described by a linear model (a straight line). Note that body mass is on the x-axis because this is our predictor variable. This is incredibly important to get right here!

Step 1 - Fit the model

The first step of any regression is to fit the model. In the cases of linear regression, this means calculating the line of best fit. In R, we use a function that you are already familiar with; lm().

m1 <- lm(flipper_length_mm ~ body_mass_g, data = penguins)

This should look quite familiar. We have to assign two arguments:

  1. The first argument is a formula. We know this because it includes a ‘tilde’ symbol: ~. The variable name on the left of the ~ should be the response variable. The variable name on the right should be the predictor variable. These are flipper length and body mass, respectively. Make sure you get these the right way round.

  2. The second argument is the name of the data frame that contains the two variables listed in the formula (penguins).

By inspecting the object we have created, we can view the values for the slope (b) and intercept (a). Remember that the general form of the linear model is \(y = a + bx\) so these values describe the line of best fit.

m1

Call:
lm(formula = flipper_length_mm ~ body_mass_g, data = penguins)

Coefficients:
(Intercept)  body_mass_g  
  136.72956      0.01528  

We can plot this line over the data on our scatterplot.

Step 2: Assess the model

A detailed summary of the properties of our model can be called up using the summary() function.

summary(m1)

Call:
lm(formula = flipper_length_mm ~ body_mass_g, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.7626  -4.9138   0.9891   5.1166  16.6392 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.367e+02  1.997e+00   68.47   <2e-16 ***
body_mass_g 1.528e-02  4.668e-04   32.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.913 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

This output is a little more complex than you may be used to so let’s go through it piece-by-piece.

First we have the “Call” and “Residuals” sections. There’s nothing much we need here so we won’t dwell on it. The Call is simply the code that was used to fit the model and as long as it matches what we expect, that’s all we need from it. The Residuals just gives us some summary stats about the residuals of our model (see the lecture videos for more information about residuals). Again, no need to pay much attention to these here.

The “Coefficients” section contains much more useful information.

The “Estimate” column shows us the estimated intercept and slope of the model. We saw these earlier when we printed the fitted model object to the console.

Staying with this table, the next three columns (Std. Error, t value and Pr(>|t|)) show us the standard error associated with each coefficient, the corresponding t-statistics, and the p-values. Remember standard errors? These are a measure of the variability of the sampling distribution associated with something we estimate from a sample. You have met these before in the context of sample means. One can calculate a standard error for many different kinds of quantities, including the intercept and slope of a regression model. And just as with a mean, we can use the standard errors to evaluate the significance of the coefficients via t-statistics.

In this case, the p-values associated with these t-statistics indicate that the intercept is significantly different from zero (p<0.001), and that the slope is significantly different from zero (p<0.001).

If you’re interested in visualising what the standard error of a regression means, there is a way to plot it. The shading around the regression line here signifies the standard error.

Finally, we have some vital model stats, in particular the R2.

This shows the R-squared (R2) of our model. It tells you what proportion (sometimes expressed as a percentage) of the variation in the data is explained, or accounted for, by the fitted line. If R2 = 1 the line passes through all the points on the graph (all the variation is accounted for) and if R2 = 0 the line explains little or none of the variation in the data. The R2 value here is 0.76, indicating 76% of the variation is explained by this model. This is very respectable, but still indicates that there are other sources of variation which remain unexplained by the line.

We also have the F statistic which is used to calculate the p-value. This assesses the significance of the overall fit of the model.

Reporting a regression

There is a significant positive relationship between body mass (g) and flipper length (mm) in the Palmer penguins (y=136.73 + 0.015x, F = 1071, d.f. = 1,340, p < 0.001).

Don’t forget to quote both degrees of freedom in the result. These are obtained from the summary information and should be given as the slope degrees of freedom first (which is always 1), followed by the error degrees of freedom.

If the results are being presented only in the text it is usually appropriate to specify the regression equation as well as the significance of the relationship as this allows the reader to see in which direction and how steep the relationship is, and to use the equation in further calculations. It may also be useful to give the units of measurement (though these should already be stated in the Methods). Often however, we will want to present the results as a figure, showing the original data and the fitted regression line. In this case, most of the statistical detail can go in the figure legend instead.

Linear regression questions

A plant physiologist studying the process of germination in the broad bean (Vicia faba) is interested in the relationship between the activity of the enzyme amylase, and the temperature at which the germinating beans are kept. As part of this work she carries out an experiment to find the relationship between glucose release (from the breakdown of starch by amylase) and temperature (over the range 2 - 20C). The data obtained from such an experiment are given in the file glucose.csv.

Is there a statistically significant relationship between temperature and glucose release (and hence, presumably, amylase activity)?

Non-linearity

Sometimes, no matter how many ritual sacrifices you make to the data gods, you collect data that isn’t linear. There are 2 options for how to treat non-linear data.

  1. Transform to make the data linear
  2. Fit a non-linear model

Transforming Data

One way to deal with non-linear data is to run away and watch Star Trek Discovery on Paramount+. Another, arguably more useful way, is to use a mathematical transformation to force it to be linear.

Let’s see how this works with some graphical examples. You can see on the left a line describing some data that are very clearly non-linear. In fact, these data follow a square relationship as shown in the equation on the graph. The graph on the right is exactly the same data but in this case, I have taken the square root of the equation. As you can see, the result is a clear, linear relationship.

Most importantly, when we transform the non-linear equation, we get something that looks much more like the general form of the linear equation (\(y = bx + a\)).

\[ y = x^2 + a\] \[ \sqrt y = x + \sqrt a\]

One very common type of non-linearity in real data is logarithmic data. We can see again here that if we apply the right transformation (in this case a log transformation), we can turn a non-linear relationship into a linear one. Once we have achieved this, we can simply employ a linear regression on the transformed data.

\[ y = ax^b\] \[ log(y) = log(a) + b~log(x) \]

Marsupial example

Let’s look at an example from some real data. Here we have data on brain and body size in two groups of animals: Cetaceans and marsupials.

brains <- read.csv("Braindata.csv", header = T)
Order Species Brain.size Body.size
Cetacea Balaena_mysticetus 2738 79691179
Cetacea Balaenoptera_acutorostrata NA 5587094
Cetacea Balaenoptera_bonaerensis NA NA
Cetacea Balaenoptera_borealis 4900 22106252
Cetacea Balaenoptera_edeni NA 20000000
Cetacea Balaenoptera_musculus 3636 154321305

Let’s start by focusing on the marsupials by subsetting them out of the larger dataset. We can see when we plot the data that what we have here looks both nonlinear and skewed. You can see from the plot below that both brain size and body mass are heavily skewed towards the smaller values. Similarly, the scatterplot shows the data is very clearly non-linear.

Let’s log transform both variables and see if that helps.

marsupials$log.brain <- log10(marsupials$Brain.size)
marsupials$log.body <- log10(marsupials$Body.size)

As you can see, the transformed data is much easier to work with. The data looks linear and the distributions look more evenly spread. The distributions of the data don’t have to be normal to perform a regression by the way. But very strongly skewed data as we saw in the untransformed data introduces problems with fitting the model. For example, you can see in the untransformed data that it is very clearly heteroscedastic (variance changes across the sample). This means that if I were to try and fit a model to those data, the residuals would be much greater at higher values of x and y. As you can see from the plots below, this means the standard errors of the model (the shaded area around the fitted line) get much larger at higher values of body and brain size. This is the case whether you fit a linear (left) or non-linear model (right).

Now that we have linear data, we can analyse it with linear regression. Have a go and see if you can interpret the results using the code from the linear regerssion code earlier.

Afterword

There is another way to go and this would be to fit a different kind of model. You’ll be learning more about other types of regression in future topics.